Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document that watcher stops see #273 and add each_with_retry #275

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

grosser
Copy link
Contributor

@grosser grosser commented Nov 9, 2017

nicer alternative to the current loop { watch } we are using, could also make retry: true an option for each, but wanted to keep it clean

@simon3z

/cc @jonmoter

@simon3z
Copy link
Collaborator

simon3z commented Nov 9, 2017

As mentioned on issue #273 I don't think that a logic as each_with_retry belongs to kubeclient.
👍 for the documentation.

@grosser in the specifics of the code: what's the new done flag is used for?

cc @moolitayer @cben

end
done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only way I see this can be false is if connection was interrupted before at least 1 complete line was received?

@grosser
Copy link
Contributor Author

grosser commented Nov 9, 2017 via email

@grosser
Copy link
Contributor Author

grosser commented Nov 27, 2017

ok to merge this ?
I'd think having a stable/reliable watcher is a very common usecase, there can be many reasons a connection is interrupted and blindly retrying breaks when also using .finish

Copy link

@zeari zeari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little non-standard since obj.each usually returns obj but im ok with these changes 👍

@grosser
Copy link
Contributor Author

grosser commented Jan 4, 2018

@simon3z I can haz merge ?

@simon3z
Copy link
Collaborator

simon3z commented Jan 4, 2018

@simon3z I can haz merge ?

cc @moolitayer @cben

@cben
Copy link
Collaborator

cben commented Jan 22, 2018

Hi, I'm very much aware I owe a review here.
Meanwhile I've been trying to figure out how to restart precisely. I recently asked on kubernetes/website#6540 (comment) and IIUC the answers, it's officially possible to take last seen resourceVersion of individual object in collection during watch, and start new collection watch from that version. (should deal with possibility of 410 error meaning version is too old.)

Also @pkliczewski now told me apparently official Go client's "informer framework" solves it, hadn't checked it yet.

So, about this PR:

  • WatchStream here remains version-agnostic. If user created it via watch_* functions with resource_version: arg, it will have version baked into url, if not it won't.
  • each_with_retry with specific starting version is not good. It means you'll get same data again.
    Generally, restart with overlap is problematic to consume: assuming you want to ignore out-of-date updates, those are hard to recognize, because comparing 2 versions is not allowed.

IMHO restart logic does belong in kubeclient, but we'd need something more version-aware.
@grosser what do you think, how would you like to continue this?

@grosser
Copy link
Contributor Author

grosser commented Jan 22, 2018

so "remove resource_version from subsequent calls"
or "keep track of resource_version and pass it when retrying" ?
... first seems reasonable/easy, second would be nice, but seems a bit scary

@qingling128
Copy link

We are having similar issues when connection is closed and the watcher crashes instead of retrying.

+1 for having some logic in the kubeclient to:

  1. Retry when connection is closed.
  2. Keep track of resource_version and pass it when retrying.

cben added a commit to cben/kubeclient that referenced this pull request Feb 12, 2020
need some way to distinguish "exited intentionally" (.finish/break/return)
from connection closed.
@cben
Copy link
Collaborator

cben commented Feb 27, 2020

I quite forgot this PR. Putting resumption aside, there was a good point here that .each should distinguish deliberate exit — by break / return / .finish — from forced connection closure.

I'm by now aware of 5 ways watching can exit:

  1. .finish => exits cleanly ✔️
  2. break / return => exits cleanly ✔️
  3. error such as 410 Gone => does not raise any exception, instead passes a "notice" with ERROR type into the block, then exits cleanly 🤦‍♂️
    High request rate towards Kubernetes API server fabric8io/fluent-plugin-kubernetes_metadata_filter#213
  4. connection closed by server => may exit cleanly with no exception 😕
  5. connection closed by server/middleboxes(?)/network problems => may raise HTTP::ConnectionError.
    Fluentd Crash fabric8io/fluent-plugin-kubernetes_metadata_filter#194

The technical reason is we have handle_exception helper for RestClient code paths but forgot to do any error handling for HTTP code paths 🤣

(4) vs (5) is clearly a mess. I'm tempted to say user doesn't care how exactly a watch got disconnected — the bottom line is the "infinite" loop got broken, and user needs to resume/retry in
same way.

If backward compatibility is not a question, I'd argue (3) (4) (5) should all throw an exception.
(and not leak HTTP exception — users shouldn't be aware about the precise http libraries kubeclient uses — better wrap with our Kubeclient::HttpError).
It was an unplanned accident that (4) may exit cleanly; from POV of caller the call promises an "infinite" loop, and IMO any disconnection should be an in-your-face error so people understand it's something they must handle.

The solution here of returning a boolean has benefit of backward compatibility.
There is existing code out there calling watch and handling resumption, and it's written assuming clean exit (4)... Most of it will fail in case of error (5) though?

But a boolean doesn't cover (3) well. I mean current behavior of passing error data into the block is ... workable, but not elegant. And handling 410 is essential to correct usage if you pass resourceVersion!

A "meta" consideration: as we've seen here, there is value in accumulating expereince of how things actually break, and if we just swallow errors into a boolean, there will be no details in logs, so we (kubeclient) will not learn, and users will not learn.

I propose to break compatibility (and bump major version) and turn (3) (4) (5) into Kubeclient errors.

@grosser @qingling128 @jcantrill @mszabo @Stono @fradee what do you think?
Would that be a step forward? Or too disruptive?

[And yes, I still hope in some future version we'll get Kubeclient itself resuming after disconnection. IFF initial resourceVersion was set. And it will still leave caller to handle 410, so I think we need to fix the interface first anyway.]

[I'm also working on adding some docs on watching, because README is really lacking...]

@mszabo
Copy link

mszabo commented Feb 27, 2020

@cben Thanks for working on this. I think it would be a step forward.

@Stono
Copy link

Stono commented Feb 28, 2020

Be aware if you're watching streams that don't change frequently, you'll see 410 GONE quite a bit when you reconnect.

Examples are say, CRD's that are rarely updated. When you reconnect you'll likely get a 410 GONE if you're trying to resume from the last seen CRD, and you'll need to resume from 0 instead - dropping any ADDED events with a timestamp <= the start of your stream.

@grosser
Copy link
Contributor Author

grosser commented Feb 29, 2020

raise on 3-4-5 sounds like an improvement over silently stopping the loop, it should be easy to see/catch for client code.

I'd still like a resume: true option (or by default, but risks loops) or so that takes the resourceVersion of the incoming objects and then reconnects with the latest version when disconnected.

@qingling128
Copy link

Re #275 (comment):
That does sound like a step forward to me. We did run into an issue fluent-plugin-kubernetes_metadata_filter/pull/214 when an exception was not thrown as expected.

@cben
Copy link
Collaborator

cben commented Mar 11, 2020

Confirmed (3) {"type": "ERROR", ...} notices are returned with a 200 status 😐
Turns out it's deliberate in kubernetes per kubernetes/kubernetes#25151:

Either way, for web sockets if we don't fix this clients can't get this
error (the browser eats the 410 during the websocket negotiation and client
code can't get access). So it's at least broken for web sockets.

This is unfortunate for :raw mode — if we don't parse the input, we can't raise an exception.

  • Could document this exception?
  • Could or add a heuristic parsing if raw text contains ERROR, just to check whether to raise an exception, otherwise discarding the parse results.

@rockb1017
Copy link

Thank you for you work and great discussions! any plan for this to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants